266 research outputs found

    Coupled Ensembles of Neural Networks

    Full text link
    We investigate in this paper the architecture of deep convolutional networks. Building on existing state of the art models, we propose a reconfiguration of the model parameters into several parallel branches at the global network level, with each branch being a standalone CNN. We show that this arrangement is an efficient way to significantly reduce the number of parameters without losing performance or to significantly improve the performance with the same level of performance. The use of branches brings an additional form of regularization. In addition to the split into parallel branches, we propose a tighter coupling of these branches by placing the "fuse (averaging) layer" before the Log-Likelihood and SoftMax layers during training. This gives another significant performance improvement, the tighter coupling favouring the learning of better representations, even at the level of the individual branches. We refer to this branched architecture as "coupled ensembles". The approach is very generic and can be applied with almost any DCNN architecture. With coupled ensembles of DenseNet-BC and parameter budget of 25M, we obtain error rates of 2.92%, 15.68% and 1.50% respectively on CIFAR-10, CIFAR-100 and SVHN tasks. For the same budget, DenseNet-BC has error rate of 3.46%, 17.18%, and 1.8% respectively. With ensembles of coupled ensembles, of DenseNet-BC networks, with 50M total parameters, we obtain error rates of 2.72%, 15.13% and 1.42% respectively on these tasks

    LIG and LIRIS at TRECVID 2008: High Level Feature Extraction and Collaborative Annotation

    Get PDF
    International audienceThis paper describes our participations of LIG and LIRIS to the TRECVID 2008 High Level Features detection task. We evaluated several fusion strategies and especially rank fusion. Results show that including as many low-level and intermediate features as possible is the best strategy, that SIFT features are very important, that the way in which the fusion from the various low-level and intermediate features does matter, that the type of mean (arithmetic, geometric and harmonic) does matter. LIG and LIRIS best runs respectively have a Mean Inferred Average Precision of 0.0833 and 0.0598; both above TRECVID 2008 HLF detection task median performance. LIG and LIRIS also co-organized the TRECVID 2008 collaborative annotation. 40 teams did 1235428 annotations. The development collection was annotated at least once at 100\%, at least twice at 37.6\%, at least three times at 3.99\% and at least four times at 0.06\%. Thanks to the active learning and active cleaning used approach, the annotations that were done multiple times were those for which the risk of error was maximum

    Mots audio-visuels joints pour la détection de scÚnes violentes dans les vidéos

    Get PDF
    National audienceCe papier présente une représentation audio-visuelle des données pour la détection des scÚnes violentes dans les films. Les travaux existants dans ce domaine considÚrent l'infor- mation visuelle ou l'information audio; voire leur fusion classique. Jusqu'à présent peu d'ap- proches ont exploré leur dépendance mutuelle pour la détection de scÚnes violentes. Ainsi, nous proposons un descripteur qui fournit des indices multimodaux audio et visuels; tout d'abord en assemblant les descripteurs audio et visuels, ensuite en révélant statistiquement les motifs conjoints multimodaux. La validation expérimentale a été effectuée dans le cadre de la tùche "détection de scÚnes violentes" de MediaEval 2013. Les résultats obtenus montrent le potentiel de l'approche proposée en comparaison avec les méthodes utilisant les descripteurs audio et visuels séparément ou d'autres types de fusion

    A factorized model for multiple SVM and multi-label classification for large scale multimedia indexing

    No full text
    International audienceThis paper presents a set of improvements for SVM-based large scale multimedia indexing. The proposed method is particularly suited for the detection of many target concepts at once and for highly imbalanced classes (very infrequent concepts). The method is based on the use of multiple SVMs (MSVM) for dealing with the class imbalance and on some adaptations of this approach in order to allow for an efficient implementation using optimized linear algebra routines. The implementation also involves hashed structures allowing the factorization of computations between the multiple SVMs and the multiple target concepts, and is denoted as Factorized-MSVM.Experiments were conducted on a large-scale dataset, namely TRECVid 2012 semantic indexing task. Results show that the Factorized-MSVM performs as well as the original MSVM, but it is significantly much faster. Speed-ups by factors of several hundreds were obtained for the simultaneous classification of 346 concepts, when compared to the original MSVM implementation using the popular libSVM implementation

    Approche par patrons linguistiques pour la détection automatique du locuteur : application à l'indexation par le contenu des journaux télévisés

    No full text
    National audienceL'identitĂ© des personnes dans les documents audiovisuels reprĂ©sente une information sĂ©mantique importante pour un processus d'indexation et de recherche par le contenu. La tĂąche de dĂ©tection de l'identitĂ© des locuteurs peut ĂȘtre rĂ©alisĂ©e en exploitant des Ă©lĂ©ments d'informations issues de diffĂ©rentes modalitĂ©s (texte, image et son). Dans cet article, nous proposons une approche pour l'indexation de l'identitĂ© des locuteurs dans les journaux tĂ©lĂ©visĂ©s en exploitant le contenu audio. AprĂšs une phase de segmentation en locuteurs, une identitĂ© est attribuĂ©e Ă  des segments de parole par l'intermĂ©diaire de patrons linguistiques appliquĂ©s Ă  leur transcription produite par reconnaissance vocale. Trois types de patrons sont utilisĂ©s pour prĂ©dire l'identitĂ© du locuteur dans les segments prĂ©cĂ©dents, courants ou suivants. Ces prĂ©dictions sont ensuite propagĂ©es Ă  d'autres segments par similaritĂ© au niveau acoustique. Des Ă©valuations ont Ă©tĂ© menĂ©es sur une partie du corpus TREC 2003 : une identitĂ© de locuteur a pu ĂȘtre attribuĂ©e Ă  53% du corpus annotĂ© avec une prĂ©cision de 82%

    Descriptor Optimization for Multimedia Indexing and Retrieval

    No full text
    International audienceIn this paper, we propose and evaluate a method for optimizing descriptors used for content-based multimedia indexing and retrieval. A large variety of descriptors are commonly used for this purpose. However, the most efficient ones often have characteristics preventing them to be easily used in large scale systems. They may have very high dimensionality (up to tens of thousands dimensions) and/or be suited for a distance costly to compute (e.g. fflchi-square). The proposed method combines a PCA-based dimensionality reduction with pre- and post-PCA non-linear transformations. The resulting transformation is globally optimized. The produced descriptors have a much lower dimensionality while performing at least as well, and often significantly better, with the Euclidean distance than the original high dimensionality descriptors with their optimal distance. The method has been validated and evaluated for a variety of descriptors using TRECVid 2010 semantic indexing task data. It has then be applied at large scale for the TRECVid 2012 semantic indexing task on tens of descriptors of various types and with initial dimensionalities from 15 up to 32,768. The same transformation can be used also for multimedia retrieval in the context of query by example and/or relevance feedback

    Semantic Video Content Indexing and Retrieval using Conceptual Graphs

    No full text
    International audienceIn this article, we propose a conceptual model for video content description. This model is an extension of the EMIRÂČ model proposed for image representation and retrieval. The proposed extensions include the addition of some views such as temporal and event views that are specific to video documents, the extension of the structural view to the temporal structure of video documents, and the extension of the perceptive view to motion descriptors. We have kept the formalism of conceptual graphs for the representation of the semantic content. The various concepts and relations involved can be taken from general and/or domain specific ontologies and completed by lists of instances (individuals). The proposed model has been applied on TREC video 2002 and 2003 corpora that mainly contain TV news and commercials videos

    Annotation de vidéos par paires rares de concepts

    No full text
    National audienceLa dĂ©tection d'un concept visuel dans les vidĂ©os est une tĂąche difficile, spĂ©cialement pour les concepts rares ou pour ceux dont il est compliquĂ© de dĂ©crire visuellement. Cette question devient encore plus difficile quand on veut dĂ©tecter une paire de concepts au lieu d'un seul. En effet, plus le nombre de concepts prĂ©sents dans une scĂšne vidĂ©o est grand, plus cette derniĂšre est complexe visuellement, et donc la difficultĂ© de lui trouver une description spĂ©cifique s'accroit encore plus. Deux directions principales peuvent eˆtre suivies pour tacler ce problĂšme: 1) dĂ©tecter chaque concept sĂ©parĂ©ment et combiner ensuite les prĂ©dictions de leurs dĂ©tecteurs correspondants d'une maniĂšre similaire Ă  celle utilisĂ©e souvent en recherche d'information, ou 2) considĂ©rer le couple comme un nouveau concept et gĂ©nĂ©rer un classifieur supervisĂ© pour ce nouveau concept en infĂ©rant de nouvelles annotations Ă  partir de celles des deux concepts formant la paire. Chacune de ces approches a ses avantages et ses inconvĂ©nients. Le problĂšme majeur de la deuxiĂšme mĂ©thode est la nĂ©cessitĂ© d'un ensemble de donnĂ©es annotĂ©es, surtout pour la classe positive. S'il y a des concepts rares, cette raretĂ© s'accroit encore plus pour les paires formĂ©es de leurs combinaisons. D'une autre part, il peut y avoir deux concepts assez frĂ©quents mais il est trĂšs rare qu'ils occurrent conjointement dans un meˆme document. Certains travaux de l'Ă©tat de l'art ont proposĂ© de palier ce problĂšme en rĂ©coltant des exemples reprĂ©sentatifs des classes Ă©tudiĂ©es du web, mais cette tĂąche reste couˆteuse en temps et argent. Nous avons comparĂ© les deux types d'approches sans recourir Ă  des ressources externes. Notre Ă©valuation a Ă©tĂ© rĂ©alisĂ©e dans le cadre de la sous-tĂąche "dĂ©tection de paire de concepts" de la tĂąche d'indexation sĂ©mantique (SIN) de TRECVID 2013, et les rĂ©sultats ont rĂ©vĂ©lĂ© que pour le cas des vidĂ©os, si on n'utilise pas de ressources d'information externes, les approches qui fusionnent les rĂ©sultats des deux dĂ©tecteurs sont plus performantes, contrairement Ă  ce qui a Ă©tĂ© montrĂ© dans des travaux antĂ©rieurs pour le cas des images fixes. La performance des mĂ©thodes dĂ©crites dĂ©passe celle du meilleur rĂ©sultat officiel de la campagne d'Ă©valuation prĂ©cĂ©demment citĂ©e, de 9% en termes de gain relatif sur la prĂ©cision moyenne (MAP)

    Temporal re-scoring vs. temporal descriptors for semantic indexing of videos

    No full text
    International audienceThe automated indexing of image and video is a difficult problem because of the "distance" between the arrays of numbers encoding these documents and the concepts (e.g. people, places, events or objects) with which we wish to annotate them. Methods exist for this but their results are far from satisfactory in terms of generality and accuracy. Existing methods typically use a single set of such examples and consider it as uniform. This is not optimal because the same concept may appear in various contexts and its appearance may be very different depending upon these contexts. The context has been widely used in the state of the art to treat various problems. However, the temporal context seems to be the most crucial and the most effective for the case of videos. In this paper, we present a comparative study between two methods exploiting the temporal context for semantic video indexing. The proposed approaches use temporal information that is derived from two different sources: low-level content and semantic information. Our experiments on TRECVID'12 collection showed interesting results that confirm the usefulness of the temporal context and demonstrate which of the two approaches is more effective
    • 

    corecore